Day 17 - Regular expressions - Groups

56

$ echo "alpha10" | sed -r s,"([a-z]+([0-9]+))","Full code:\1 - Number:\2",

Full code:alpha10 - Number:10

you will realise that the inner group ([0-9]+) matches the digits 10 but the outer group matched

both the letters and the digits, basically acting as it was ([a-z]+[0-9]+), without the internal group.

When you use groups in the replacement string you are not forced to keep them in order, so for

example

$ echo "First,Second" | sed -r s/"(.*),(.*)"/"\2,\1"/

Second,First

is a very simple way to swap two fields in a comma-separated string. Well, maybe you don’t think

it is that simple, let’s review it together. First of all I used a / to separate the search and replacement

strings because the search string will contain a comma. The first group matches anything but up to

the first comma, after which the second group captures the rest. In the replacement string I print the

content of the second group, then a comma, and the content of the first group.

We can use groups in grep as well, even though the reuse of the matching values is obviously

limited by the fact that grep doesn’t replace text. The best use of groups in grep is with the so-

called lookaround expressions, that are provided by the Perl-compatible regular expression syntax

(PCRE), activated by the -P switch, as opposed to the -E that we used so far, that activates the

Extended syntax (ERE). The differences between these syntaxes are outside the scope of this book,

so feel free to investigate the matter online.

Lookaround expressions can provide information about the surroundings of a matching pattern

without including the surroundings themselves. Let’s look at an example: the simplest form of

lookaround is the positive lookahead, where a group specifies what should follow the matching

part of the string

$ cat examples.txt | grep -P "[A-Za-z ]+(?=[0-9]+)"

Police 101

H2O

R2-D2

Johnny 5

Cyborg 009

Here, the matching regular expression is [A-Za-z ]+, so strings of lowercase letters, uppercase letters,

and spaces. We are however interested only in those strings that are also followed by one or more

digits (?=[0-9]+). The effect is clear if you have a coloured output on your shell, or if you use the

-o option